RabiArmany_Final

Author

Rabi Armany

Published

August 19, 2024

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(plotly)

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
library(gganimate)

Final Project: CO2 Emissions by Country

Data Description

In this project, I examine the CO2 Emissions Estimates data from the UN database. This data set has 4 main components: the country, the year data is estimated for (1975, 1985, 2005, 2010, 2015, 2018, 2019, and 2020), the total emissions of that country in that year (in thousand metric tons of carbon dioxide, the emissions per capita (metric tons of carbon dioxide), and finally additional footnotes and a source. Though this data is valuable, it leaves room for extrapolation. The main questions I wanted answered were in regards to whether or not there’s a relationship between GDP per capita and CO2 Emissions, and which countries have had the highest increases and decreases in per capita emissions since the beginning of data estimation.

CO2 <- read_csv("CO2_Emissions.csv", skip = 1) #reads in clean version of UN CO2 emissions
New names:
Rows: 2264 Columns: 7
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(4): ...2, Series, Footnotes, Source dbl (2): Region/Country/Area, Year num
(1): Value
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...2`
head(CO2, 10)
# A tibble: 10 × 7
   `Region/Country/Area` ...2     Year Series             Value Footnotes Source
                   <dbl> <chr>   <dbl> <chr>              <dbl> <chr>     <chr> 
 1                     8 Albania  1975 Emissions (thous… 4524   <NA>      Inter…
 2                     8 Albania  1985 Emissions (thous… 7145   <NA>      Inter…
 3                     8 Albania  2005 Emissions (thous… 3980   <NA>      Inter…
 4                     8 Albania  2010 Emissions (thous… 4074   <NA>      Inter…
 5                     8 Albania  2015 Emissions (thous… 3975   <NA>      Inter…
 6                     8 Albania  2018 Emissions (thous… 4525   <NA>      Inter…
 7                     8 Albania  2019 Emissions (thous… 4200   <NA>      Inter…
 8                     8 Albania  2020 Emissions (thous… 3512   <NA>      Inter…
 9                     8 Albania  1975 Emissions per ca…    1.8 <NA>      Inter…
10                     8 Albania  1985 Emissions per ca…    2.3 <NA>      Inter…
UNGDP <- read_csv("UNGDP.csv")
New names:
Rows: 6776 Columns: 7
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(7): T13, Gross domestic product and gross domestic product per capita, ...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...3`
• `` -> `...4`
• `` -> `...5`
• `` -> `...6`
• `` -> `...7`
head(UNGDP, 10) #preview the data
# A tibble: 10 × 7
   T13                 Gross domestic product an…¹ ...3  ...4  ...5  ...6  ...7 
   <chr>               <chr>                       <chr> <chr> <chr> <chr> <chr>
 1 Region/Country/Area <NA>                        Year  Seri… Value Foot… Sour…
 2 1                   Total, all countries or ar… 1995  GDP … 31,2… <NA>  Unit…
 3 1                   Total, all countries or ar… 2005  GDP … 47,7… <NA>  Unit…
 4 1                   Total, all countries or ar… 2010  GDP … 66,5… <NA>  Unit…
 5 1                   Total, all countries or ar… 2015  GDP … 75,2… <NA>  Unit…
 6 1                   Total, all countries or ar… 2019  GDP … 87,7… <NA>  Unit…
 7 1                   Total, all countries or ar… 2020  GDP … 85,3… <NA>  Unit…
 8 1                   Total, all countries or ar… 2021  GDP … 96,6… <NA>  Unit…
 9 1                   Total, all countries or ar… 1995  GDP … 5,446 <NA>  Unit…
10 1                   Total, all countries or ar… 2005  GDP … 7,287 <NA>  Unit…
# ℹ abbreviated name:
#   ¹​`Gross domestic product and gross domestic product per capita`

Though this doesn’t paint the complete picture of the data, it provides a rudimentary synopsis of the variables that the data holds. This data is dirty, and in order to determine the answer to these questions it needs to be cleaned.

Data Transformation

The CO2 data will be easier to understand if there’s separate columns for emissions per capita and total emissions by country. Next, we can move those columns to after the “Year” column. We can get rid of the country code as well, since the information in that data isn’t necessary for answering our questions. We must also rename the columns. Finally, the NA’s in the footnotes can be replaced with “None”.

CO2_wider <- CO2 |>
pivot_wider(names_from = 4, values_from = 5) |> #makes new columns for emissions per capita and total emissions by country
  relocate(`Emissions (thousand metric tons of carbon dioxide)`, `Emissions per capita (metric tons of carbon dioxide)`, .after = `Year`) |> #moves emissions to directly after year
select(2:7) 
#gets rid of country code, not necessary since code doesn't contain useful information

CO2_Named <- setNames(CO2_wider, c("Country", "Year", "Emissions (thousand metric tons of carbon dioxide)", "Emissions per capita (metric tons of carbon dioxide)", "Footnotes", "Source")) 
#names columns based off CO2_wider

CO2Clean <- CO2_Named |>
    mutate(Footnotes=replace_na(Footnotes, "None")) #replaces NA values with "None" for no footnotes
head(CO2Clean)
# A tibble: 6 × 6
  Country  Year Emissions (thousand me…¹ Emissions per capita…² Footnotes Source
  <chr>   <dbl>                    <dbl>                  <dbl> <chr>     <chr> 
1 Albania  1975                     4524                    1.8 None      Inter…
2 Albania  1985                     7145                    2.3 None      Inter…
3 Albania  2005                     3980                    1.3 None      Inter…
4 Albania  2010                     4074                    1.4 None      Inter…
5 Albania  2015                     3975                    1.3 None      Inter…
6 Albania  2018                     4525                    1.5 None      Inter…
# ℹ abbreviated names: ¹​`Emissions (thousand metric tons of carbon dioxide)`,
#   ²​`Emissions per capita (metric tons of carbon dioxide)`

This data is much easier to work with. Now, the GDP data. We can start by renaming the columns, leaving x’s for columns we want to remove, and removing them. Since the most recent common year between the GDP data and the CO2 data is 2020, we’ll zero in on that data. Finally, we only need GDP per capita, not Total GDP.

GDP_named2020 <- setNames(UNGDP, c("x", "Country", "Year", "Series", "Value", "x", "x")) |> #renames columns
  select(!contains("x")) |> #removes unwanted columns
  filter(str_detect(Year, "2020")) |> #filters to year 2020
  filter(str_detect(Series, "GDP per capita")) #focuses on GDP per capita
head(GDP_named2020)
# A tibble: 6 × 4
  Country                       Year  Series                      Value 
  <chr>                         <chr> <chr>                       <chr> 
1 Total, all countries or areas 2020  GDP per capita (US dollars) 10,883
2 Africa                        2020  GDP per capita (US dollars) 1,799 
3 Northern Africa               2020  GDP per capita (US dollars) 3,039 
4 Sub-Saharan Africa            2020  GDP per capita (US dollars) 1,518 
5 Eastern Africa                2020  GDP per capita (US dollars) 951   
6 Middle Africa                 2020  GDP per capita (US dollars) 1,054 

Now we can begin to transform the data, and create a data set that contains the pertinent data from the GDP data and the CO2 data. We can start by creating a CO2 data set that only contains the year 2020, as that is the most recent year between the CO2 data set and the GDP data set. We can then merge the data sets by the “Country” column. This also filters out regions, totals, and other data that we don’t need. Rename the columns of the new data set, leave x’s for data that you don’t need, remove the x’s, and relocate the necessary columns to where they are needed. The commas in the gdp of this data need to be removed and the values need to be transformed from characters into numbers. A new column is created for these numbers. Finally, certain countries are read in with special characters that R is unfamiliar with, which need to be replaced.

CO2Clean2020 <- CO2Clean |>
  filter(str_detect(Year, "2020")) #most recent common year between CO2 and GDP

CO2GDP_clean <- left_join(CO2Clean2020, GDP_named2020, by = "Country") #joins CO2 and GDP by country
CO2GDP_clean <- setNames(CO2GDP_clean, c("Country", "Year", "Emissions (thousand metric tons of carbon dioxide)", "Emissions per capita (metric tons of carbon dioxide)", "x", "source", "x", "x", "GDP per capita (US dollars)")) |> #sets names
  select(!starts_with("x")) |> #removes x columns
  relocate("GDP per capita (US dollars)", .after = "Year") #relocates GDP

CO2GDP_clean <- CO2GDP_clean |>
  mutate(GDP = as.numeric(str_replace(`GDP per capita (US dollars)`, ",", "")))  #turns GDP into numeric value without commas

CO2GDP_clean$Country[29] = "Côte d'Ivoire" #replaces rows with characters R can read
CO2GDP_clean$Country[32] = "Curaçao" #replaces rows with characters R can read
CO2GDP_clean$Country[135] = "Türkiye" #replaces rows with characters R can read

head(CO2GDP_clean)
# A tibble: 6 × 7
  Country    Year `GDP per capita (US dollars)` Emissions (thousand metric ton…¹
  <chr>     <dbl> <chr>                                                    <dbl>
1 Albania    2020 5,278                                                     3512
2 Algeria    2020 3,354                                                   135599
3 Angola     2020 1,640                                                    16939
4 Argentina  2020 8,561                                                   150666
5 Armenia    2020 4,506                                                     6464
6 Australia  2020 55,774                                                  378417
# ℹ abbreviated name: ¹​`Emissions (thousand metric tons of carbon dioxide)`
# ℹ 3 more variables:
#   `Emissions per capita (metric tons of carbon dioxide)` <dbl>, source <chr>,
#   GDP <dbl>

Next, in order to determine which countries have had the largest changes in CO2 emissions, a new column must be created to highlights those differences. To do this we can highlight the columns 2020 and 1975, and then create a new columns for each of those years. From there, we can create a column with the differences in those years. Finally, two data sets can be created, one that displays the 10 countries with the highest increase in emissions, and another with the 10 countries with the highest decrease in emissions. We can reorder these so that when graphed, they’ll be displayed in a more visually appealing manner

CO2_Wide_years <- CO2Clean |>
  filter(str_detect(Year, "2020|1975")) |>
  select(!contains("thousand")) #takes clean CO2 data with only the years 1975 and 2020
CO2_Wide_years <- pivot_wider(CO2_Wide_years, names_from = "Year", values_from = "Emissions per capita (metric tons of carbon dioxide)") #pivots wider to create new columns for 1975 and 2020
CO2_Wide_years <- CO2_Wide_years|>
  mutate(Difference = CO2_Wide_years$"2020" -CO2_Wide_years$"1975") #creates new column with difference between 2020 and 1975

CO2shortH <- CO2_Wide_years |>
  group_by(Difference) |>
  arrange(desc(Difference)) |> #arrange difference by highest to lowest
  head(10) |> #top 10 countries with the highest increase in emissions per capita
  mutate(Country = fct_reorder(Country, Difference))  #helps bar graph be in order

CO2shortT <- CO2_Wide_years |>
  group_by(Difference) |>
  drop_na(Difference) |> #drops NA values
  arrange((Difference)) |> #arrange difference from lowest to highest
  head(10)  #top 10 countries with the highest decrease in emissions per capita
  
CO2shortT$Country[1] = "Curaçao" #changes special characters in Curaçao to readable format
CO2shortT <- CO2shortT |>
  mutate(Country = fct_reorder(Country, Difference)) #helps bar graph be in order

Now, both data sets have the data in the necessary position to be visualized. To determine the relationship between GDP per capita and emissions per capita, and since there are 150 countries to plot, a scatter plot with a geom_smooth line will be most efficient, and help us see patterns in the data. A simple bar chart will help display the 10 countries with the highest increase and decrease in CO2 Emissions.

GDPgraph <- ggplot(CO2GDP_clean, aes(x = GDP, y = `Emissions per capita (metric tons of carbon dioxide)`, label = Country)) + 
  geom_point() + #adds point graph 
  geom_smooth() +  #adds smooth line
  labs(title = "GDP per capita and emissions per capita", x = "GDP per capita (US dollars)", y = "Emissions per Capita (metric tons of carbon dioxide)") + #creates labels
  theme(axis.title.y = element_text(size=8)) #changes y axis title size


Increase <- ggplot(CO2shortH, aes(x = Country, y = as.numeric(Difference))) + geom_col() + #creates bar/column graph
  theme(axis.text.x = element_text(angle = 55, hjust = 1)) + #creates bar graph with CO2ShortH to show highest countries
  labs(title = "Countries with highest increase in emissions per capita", x = "Country", y = "Change in Emissions per Capita (metric tons of carbon dioxide)") + #creates labels
  theme(axis.title.y = element_text(size=5)) #changes y axis title size



Decrease <- ggplot(CO2shortT, aes(x = Country, y = as.numeric(Difference))) + 
  geom_col() + 
  theme(axis.text.x = element_text(angle = 55, hjust = 1)) + # creates bar graph with CO2ShortT to show lowest countries
  labs(title = "Countries with highest decrease in emissions per capita", x = "Country", y = "Change in Emissions per Capita (metric tons of carbon dioxide)") +   #creates labels
  theme(axis.title.y = element_text(size=5)) #changes y axis title size

Analysis and Visualization

GDP per capita and CO2 Emissions

ggplotly(GDPgraph)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_smooth()`).
Warning: The following aesthetics were dropped during statistical transformation: label.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

This is the relationship between GDP per capita in US dollars and emissions per capita in metric tons of carbon dioxide. The general trend of emissions per capita as a function of GDP per capita is logarithmic. This means that generally, a country’s emissions per capita increases as their GDP per capita increases, until GDP per capita reaches about the $28000 mark, after which the curve begins to flatten out. This however is non-exhaustive, there are many countries both above and below that line. Note especially one outlier Qatar, with a GDP per capita of 52,316 and emissions per capita of 29.2 metric tons of CO2. There are other countries that display interesting data as well. Luxembourg, all the way on the right, has a GDP per capita of 117,724, and an emissions per capita of 11.8 metric tons of CO2. Australia and the United States have GDP per capitas that are slightly less and slightly more than half of Luxembourg’s GDP per capita, but both countries have higher CO2 Emissions than Luxembourg.

Highest Increase and Decrease in per capita CO2 Emissions

ggplotly(Decrease)
ggplotly(Increase)

These are the 10 countries with the highest decrease and increase in emissions per capita. Curaçao had the highest decrease, decreasing by 47.3 metric tons per person. Gibraltar had the highest increase, increasing by 16.9 metric tons per person. The median difference for all countries is 0.3. It was surprising to see the different countries on each list. The United States is commonly thought of as a country enveloped by consumerist and wasteful tendencies, yet has decreased their per capita emissions starkly. China, on the other hand, has had a comparatively much higher increase. Additionally, more of the countries with higher increase in emissions per capita are typically considered underdeveloped countries, while all of the countries with high decreases in emissions are typically considered highly developed countries. This is also unsurprising; countries with higher development indexes would likely have better technology for reducing emissions and better standards for emissions from manufacturing and industry.

Reflection

This data shows valuable insights about different countries and their CO2 emissions from 1975 until 2020. On its own, the data can reveal which countries have had the highest increase and decrease in emissions since 1975, among other data such as which countries currently have the highest and lowest emissions. When merged with data on GDP, the data can display CO2 emissions as a function of GDP, revealing an initial sharp increase in emissions as GDP increases, then tapering off logarithmically.

These revelations, though valuable, leave much to be desired. How does the authority of a government dictate how much CO2 a country is producing. How does technological development play a role in emissions? What do the expected consequence of different levels of per capita emissions look like? What different policies have the largest impact on CO2 emissions? In order to answer these questions, we would need more data on technological development, different levels of government interference in both emissions directly and also the market, climate modeling data, and advanced policy data. This data is, however, also largely obtainable through the UN data base so these answers can, in all likelihood, be extrapolated.

Bibliography

Hadley Wickham, Hadley, et al. “R for Data Science (2E).” R for Data Science (2e), r4ds.hadley.nz/. Accessed 16 Aug. 2024.

Long, James (JD), and Paul Teetor. “R Cookbook, 2nd Edition.” R Cookbook, 2nd Edition, 26 Sept. 2019, rc2e.com/.

R Core Team. “R: A Language and Environment for Statistical ## Computing.” The R Project for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria., 2021, www.R-project.org.

“Undata.” United Nations, United Nations, data.un.org/. Accessed 16 Aug. 2024.